Notes
-k and onset k are not the same k for purposes of determining unigram frequencies (and therefore pointwise mutual information). This is partly because what we are interested in is the positional stickiness, and partially because they are arguably different (phonetic) segments.Key to tables
possible is the count of possible syllables of this shape. What counts as a “possible” syllable? Different ways to do it; here we assume:
ʔ but excluding w; we distinguish orthographic d gi in addition to s x)[aː e əː ɛ i ɨ ɔ o u iə ɨə uə] with unrestricted distribution following plain onsets[a ə] that cannot occur in open syllables[ɗw tw tʰw sw zw lw rw cw ʂw ɲw ʈw kw xw ɣw ŋw hw w] (we treat w here like a labialized ʔw for co-occurrence reasons) which may not be followed by [ɨ ɔ o u ɨə uə] (ostensibly the single exception is quốc but it is typically pronounced [kwək])[m n ŋ] and 3 unreleased plosive codas [p t k][w j] with restricted distribution: [j] cannot follow [i iə e ɛ] and [w] cannot follow [əː ɔ o u uə]SV and NSV are the counts of syllables of these shapes in the SV and NSV lists, respectively
pct_SV_shape and pct_NSV_shape are the percentages of the possible number of syllables of this shape that occur in the SV or NSV lists, respectively. pct_poss_shape is simply the sum of pct_SV_shape and pct_SV_shape.
pct_poss_total is the sum of the SV and NSV counts for this shape, divided by the total sum the the possible column (17,526).
Takeaways:
Trần & Vallée 2009 report that “the prevalent monosyllabic pattern in Vietnamese…was the CVC syllable type, respectively 70% and 34% of the monosyllabic words, and respectively 70% and 20% of the language syllable inventory” (2009:232). Their counts were derived from a list of words with frequency above 2% in a 5,000 word lexicon. If we collapse the above table into their three categories (CV, CVC, CCVC), we see the numbers are quite close: about 21% C(C)V, 71% CVC and 8% CCVC.